ETL for LLMs, Integrating and Normalizing Unstructured Data | ep 13

Update: 2024-06-19

Description

In our latest episode, we sit down with Derek Tu, Founder and CEO of Carbon, a cutting-edge ETL tool designed specifically for large language models (LLMs).

Carbon is streamlining AI development by providing a platform for integrating unstructured data from various sources, enabling businesses to build innovative AI applications more efficiently while addressing data privacy and ethical concerns.

"I think people are trying to optimize around the chunking strategy... But for me, that seems a bit maybe not focusing on the right area of optimization. These embedding models themselves have gone just like, so much more advanced over the past five to 10 years that regardless of what representation you're passing in, they do a pretty good job of being able to understand that information semantically and returning the relevant chunks." - Derek Tu on the importance of embedding models over chunking strategies

"If you are cost conscious and if you're worried about performance, I would definitely look at quantizing your embeddings. I think we've probably been able to, I don't have like the exact numbers here, but I think we might be saving at least half, right, in storage costs by quantizing everything." - Derek Tu on optimizing costs and performance with vector databases

Derek Tu:

Carbon

Nicolay Gerold:

⁠LinkedIn⁠

⁠X (Twitter)

Key Takeaways:

Understand your data sources: Before building your ETL pipeline, thoroughly assess the various data sources you'll be working with, such as Slack, Email, Google Docs, and more. Consider the unique characteristics of each source, including data format, structure, and metadata.

Normalize and preprocess data: Develop strategies to normalize and preprocess the unstructured data from different sources. This may involve parsing, cleaning, and transforming the data into a standardized format that can be easily consumed by your AI models.

Experiment with chunking strategies: While there's no one-size-fits-all approach to chunking, it's essential to experiment with different strategies to find what works best for your specific use case. Consider factors like data format, structure, and the desired granularity of the chunks.

Leverage metadata and tagging: Metadata and tagging can play a crucial role in organizing and retrieving relevant data for your AI models. Implement mechanisms to capture and store important metadata, such as document types, topics, and timestamps, and consider using AI-powered tagging to automatically categorize your data.

Choose the right embedding model: Embedding models have advanced significantly in recent years, so focus on selecting the right model for your needs rather than over-optimizing chunking strategies. Consider factors like model performance, dimensionality, and compatibility with your data types.

Optimize vector database usage: When working with vector databases, consider techniques like quantization to reduce storage costs and improve performance. Experiment with different configurations and settings to find the optimal balance for your specific use case.

00:00 Introduction and Optimizing Embedding Models

03:00 The Evolution of Carbon and Focus on Unstructured Data

06:19 Customer Progression and Target Group

09:43 Interesting Use Cases and Handling Different Data Representations

13:30 Chunking Strategies and Normalization

20:14 Approach to Chunking and Choosing a Vector Database

23:06 Tech Stack and Recommended Tools

28:19 Future of Carbon: Multimodal Models and Building a Platform

Carbon, LLMs, RAG, chunking, data processing, global customer base, GDPR compliance, AI founders, AI agents, enterprises